Public gather information from news media and share it via social platform if the information is useful. Therefore, analysing news articles is helpful in understanding public’s interests. However, making sense of the news is extremely difficult because (1) everyday huge volume of articles are publishing from multiple media sites, (2) extracting topics and information from each article is not easy, (3) the importance of what people think and what the media think can be different. This project aims to analyse local news headlines published in 2020 and provide readers with useful snapshots of news trend. The scope of this project is to visualise the trends of published articles over time by topics and by media sites. In addition, detecting unusual patterns in news articles will be shown, using time series anomaly detection techniques.
Analysing engagement metrics (the number of Likes, Shares, and Comments in social media) from each article would be helpful to understand public’s interests. Newswhip also used the engagement figures to predict the likelihood articles go viral. By visualising daily patterns of engagement figures by media and by topic, we recognise anomalous pattern and public interests more easily.
The raw data set was extrated from Prrt which is providing Facebook engagement metrics. The data set contains news headlines, media sites, and date from 1 January 2020 to 31 December 2020. These news articles were published by local media sites, a mix of mainstream (Straits Times, Channel News Asia, AsiaOne) and non-mainstream (Mothership, MustShareNews, The Independent) sites. From this raw data (224k articles), my team decided to filter out articles whose engagement was lower than 10 interactions. Doing this reduces the number of articles in the dataset by 131,235 (or close to 58%) and removes articles that were: (1) removed quickly after A/B testing, (2) removed quickly after editorial corrections, and (3) republished from international news wire agencies and therefore had low to no engagement Finally, 93,166 articles will be used for further analysis. Furthermore, we will use the result of topic modeling for observing trend and detecting anomalous patterns.
The libraries below were used in this analysis. - tydyverse: data manipulation and exploration - plotly: interactive web graphics - tibbletime/tibble, timetk: data manipulation for time series - ggstatsplot: interactive web charts - anomalize: anomaly detection
packages = c('tidyverse', 'readxl', 'dplyr', 'skimr', 'tibble',
'ggstatsplot','ggpubr','corrplot','seriation', 'dendextend',
'tibbletime', 'anomalize', 'timetk','multcomp','devtools', 'car',
'DT', 'reshape2', 'plotly', 'ggforce','lubridate','tidyquant','ggplot2')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Import data set which contains topic label (93,116 articles).
Skim the imported data set. As seen below, it contains 93116 articles, posted by 6 media sites (Source), 20 unique topics (Topic) from “2020-01-01” to “2020-12-31”. We will use “Total” field for analysing engagement metrics.
| Name | article_raw |
| Number of rows | 93166 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Headline | 0 | 1 | 4 | 206 | 0 | 90945 | 0 |
| Link | 0 | 1 | 34 | 214 | 0 | 93165 | 0 |
| Source | 0 | 1 | 3 | 25 | 0 | 6 | 0 |
| Cleaned | 4 | 1 | 2 | 168 | 0 | 88810 | 0 |
| Topic | 0 | 1 | 5 | 21 | 0 | 20 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 2019-12-31 | 2020-12-31 | 2020-06-12 | 367 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Doc_ID | 0 | 1 | 115604.79 | 62212.55 | 1 | 65974.25 | 114221.5 | 168194.8 | 224121 | ▅▆▇▅▆ |
| Total | 0 | 1 | 1331.35 | 5700.90 | 10 | 38.00 | 223.0 | 884.0 | 685790 | ▇▁▁▁▁ |
| Likes | 0 | 1 | 871.38 | 3799.74 | 0 | 20.00 | 146.0 | 572.0 | 384518 | ▇▁▁▁▁ |
| Shares | 0 | 1 | 258.56 | 1200.00 | 0 | 11.00 | 37.0 | 146.0 | 209241 | ▇▁▁▁▁ |
| Comments | 0 | 1 | 201.41 | 970.54 | 0 | 3.00 | 26.0 | 130.0 | 92031 | ▇▁▁▁▁ |
article_raw$Date <- as.Date(article_raw$Date)
article_raw$month <- as.numeric(format(as.Date(article_raw$Date), "%m"))
article_raw$month <- as.factor(article_raw$month)
Engagement figures should be aggregated by Date, Source, and Topic for time sereis analysis.
We can create and see a categorical representation of data with variable name and the frequency in the form of a table using table() function. As you can see table below, each media/topic has different number of values. Each value represents the number of days a certain topic of articles published by a media. For example, MSN published articles regarding topic “t_14” in 298 days, however, it published topic ‘t_16’ articles only 6 days. This imbalance affects ways of ANOVA test.
count_table <- table(article_daily$Topic, article_daily$Source)
count_table
Asiaone Channel News Asia Mothership MSN
Covid-Affected Places 276 322 336 258
Covid-Air Travel 254 362 201 75
Covid-Daily Cases 236 358 336 231
Covid-Domitory Cases 229 296 233 166
Covid-Economy 284 363 181 60
Covid-Education 275 335 258 139
Covid-Food Delivery 313 306 334 244
Covid-Outbreaks 279 354 156 75
Crime 294 330 288 140
Entertainment News 345 325 280 134
Financial Markets 213 355 115 33
Global Politics 171 363 169 20
Malaysia 253 340 210 110
Public Transport 290 335 264 155
SG Accidents 286 257 327 242
SG Elections 153 208 201 105
SG Fake News 247 317 238 99
SG Policy 237 324 197 99
Sports 218 343 162 53
World News 262 366 162 50
The Independent Singapore The Straits Times
Covid-Affected Places 71 354
Covid-Air Travel 109 359
Covid-Daily Cases 101 359
Covid-Domitory Cases 111 343
Covid-Economy 152 366
Covid-Education 105 363
Covid-Food Delivery 112 338
Covid-Outbreaks 129 358
Crime 135 345
Entertainment News 211 344
Financial Markets 92 362
Global Politics 181 361
Malaysia 172 361
Public Transport 141 350
SG Accidents 153 336
SG Elections 294 283
SG Fake News 217 348
SG Policy 143 361
Sports 130 360
World News 74 358
article_daily_2 <- article_raw %>%
group_by(Source, Topic) %>%
summarise(mean_interaction = mean(Total, na.rm= TRUE))
article_daily_2$mean_interaction <- round(article_daily_2$mean_interaction, 0)
mean_table <- dcast(article_daily_2, Topic~Source)
mean_table
Topic Asiaone Channel News Asia Mothership MSN
1 Covid-Affected Places 295 2179 4016 2848
2 Covid-Air Travel 147 1145 3491 2094
3 Covid-Daily Cases 113 3507 1974 1391
4 Covid-Domitory Cases 265 3260 4000 2241
5 Covid-Economy 117 843 2173 1731
6 Covid-Education 171 1571 2507 3197
7 Covid-Food Delivery 353 1323 4385 2769
8 Covid-Outbreaks 453 1226 2549 3535
9 Crime 269 2067 3159 1751
10 Entertainment News 316 1092 3698 3747
11 Financial Markets 99 602 2461 1333
12 Global Politics 111 641 1521 1506
13 Malaysia 212 1993 3684 2565
14 Public Transport 324 1486 3757 2099
15 SG Accidents 427 2555 4592 2436
16 SG Elections 254 2002 2079 1195
17 SG Fake News 179 1702 2219 1523
18 SG Policy 206 1263 2657 1658
19 Sports 147 666 2312 2097
20 World News 198 1049 4037 1893
The Independent Singapore The Straits Times
1 201 1072
2 126 851
3 171 3251
4 415 2840
5 376 377
6 321 1103
7 417 1075
8 474 945
9 125 1233
10 176 775
11 132 414
12 91 328
13 244 1417
14 569 1443
15 307 1376
16 510 954
17 372 501
18 572 649
19 75 394
20 101 769
As seen above table, the average daily interactions seems to be different by media and by topic. For a more accurate analysis, ANOVA (Analysis of Variance) test is needed to determine if there is statistical differences between the means. Depending on the results of the ANOVA test, anomaly detection analysis will be affected.
In order to perform ANOVA test, we need to check the normality test. Depending on the results, functions and parameters for ANOVA test will be determined.
ggdensity(article_daily$mean_interaction,
xlab = "Average daily interaction") +
facet_wrap(~ article_daily$Source)

p <- ggdensity(article_daily$mean_interaction,
xlab = "Average daily interaction") +
facet_wrap(~ article_daily$Topic)
p

ggqqplot(article_daily$mean_interaction) +
facet_wrap(~ article_daily$Source)

ggqqplot(article_daily$mean_interaction) +
facet_wrap(~ article_daily$Topic)

As seen aggregated data set observations and visualisation graphs, we can summarise that:
Therefore, a non-parametric ANOVA test would be conducted.
The “ggbetweenstats” package provides plotting the ANOVA test result. We will set type “np”, non-parametric test, for type and “pairwise.comparisons” to compare within the groups.
# for reproducibility
set.seed(123)
# plot
p1 <- ggstatsplot::ggbetweenstats(
data = article_daily %>% filter(Topic == 'Covid-Air Travel'),
x = Source,
y = mean_interaction,
mean.plotting = TRUE,
mean.ci = TRUE,
pairwise.comparisons = TRUE, # display results from pairwise comparisons
notch = TRUE,
type = "np",
title = "Differences of daily interaction by Topic 'Covid-Air Travel'",
messages = FALSE
)
p1




















As seen ANOVA plots above, the average daily interactions by each of topic and media is: the null hypothesis of ANOVA test is that the median values across the media are the same. However, based on the ANOVA test results, the median values of the average daily interactions by topic are statistically significant at 95% confidence interval (p-value <0.05) which suggests that the values are not the same across the media sites.
In conclusion, based on the topics and media sites, people tend to have different engagement on social media. Therefore, we should divide the data set into media and topic when we perform anomaly detection.
When a certain article is more interested, the article has more engagement than others, ie. the higher engagement figures. We can see daily trend of interactions by topics in the charts below, "Daily trend of the average interactions by each topic. However, it is extremely difficult to catch when and which articles were unusally intrigued by people just by looking at the visualisation charts. For this reason, we will use time series amomaly detection technique to find the unusual patterns of articles.
We can see the details of the information when hover over the line graph. The peak points could be unusual dates, however, more statistical approaches are needed to confirm the anomalous ones.
p_a <- ggplot(data = article_daily %>% filter(Topic=='Covid-Air Travel'),
aes(x = Date, y = mean_interaction, color = Source)) +
geom_line()
p_a<- p_a + ggtitle("Daily trend of the average interaction: Topic 'Covid-Air Travel'")
fig_a <- ggplotly(p_a)
fig_a
The R ‘timetk’ package helps to plot, manipulate, an forecast time series data in R.
For analysis purpose, we will try to use the filtered data set, (Topic == ‘Covid-Affected Places’ & Source == ‘The Straits Times’). Then select “Date” and “mean_interaction” columns only for fitting for tible format. If you want to find another topic or media site, you can change the values in “topic_filter” or “Source_filter”.
topic_filter = 'Covid-Affected Places'
Source_filter = 'The Straits Times'
df_tmp <- article_daily %>%
filter(Topic == topic_filter & Source == Source_filter)%>%
select(Date,mean_interaction)
df_tibble <- as_tibble(df_tmp)
class(df_tibble)
[1] "tbl_df" "tbl" "data.frame"
The plot_anomaly_diagnostics() is a visualtion wrapper for tk_anomaly_diagnostics() group-wise anomaly detection, implements a 2-step process to detect outliers in time series (plot_anomaly_diagnostics (plot_anomaly_diagnostics)).
The decomposition separates the “season” and “trend” components from the “observed” values leaving the “remainder” for anomaly detection. Once “trend” and “season” (seasonality) is removed, anomaly detection is performed on the “remainder”. Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are determined.
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
df_tibble %>% timetk::plot_anomaly_diagnostics(Date,mean_interaction, .facet_ncol = 3)
In order to extract the anomalous dates with the average of daily interaction figures, we can use tk_anomaly_diagnostics() function. In the table, “observed” value represents the average daily interaction figure and “trend” value represents the normal value around the date.
anomal_list <- df_tibble %>% timetk::tk_anomaly_diagnostics(Date, mean_interaction) %>% filter(anomaly=='Yes')
anomal_list
# A tibble: 22 x 11
Date observed season trend remainder seasadj remainder_l1
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-03-17 6344. -128. 403. 6070. 6472. -2624.
2 2020-03-18 5019. 132. 403. 4484. 4887. -2624.
3 2020-03-31 10592 -128. 410. 10311. 10720. -2624.
4 2020-04-06 10335. -79.8 416. 9998. 10415. -2624.
5 2020-04-09 13655. -18.1 420. 13253. 13673. -2624.
6 2020-04-20 5285 -79.8 430. 4934. 5365. -2624.
7 2020-04-24 5031 37.5 433. 4561. 4994. -2624.
8 2020-04-25 4884 -111. 432. 4562. 4995. -2624.
9 2020-04-28 3831 -128. 431. 3528. 3959. -2624.
10 2020-05-06 6040. 132. 430. 5478. 5908. -2624.
# … with 12 more rows, and 4 more variables: remainder_l2 <dbl>,
# anomaly <chr>, recomposed_l1 <dbl>, recomposed_l2 <dbl>
From the anomal_list dataset, we can also find detailed information, such as the articles’ headline and engagement figures, from raw data set. With this detailed information, we can also look through which articles affected heavily to public in a certain day.
# Select anomalous date
date_filter <- anomal_list$Date[1]
# Find trend value (normal average of daily interaction)
anomal_list2 <- anomal_list %>% filter(Date == date_filter)
trend <- round(anomal_list2$trend,0)
# Create anomalous article list with Topic, Source, Date, Headline, Total(Engagement), and Trend columns
anomal_article <- article_raw %>%
filter(Topic == topic_filter, Source == Source_filter, Date == date_filter) %>%
select(Topic, Source, Date, Headline, Total)
anomal_article$Trend <- trend
datatable(anomal_article)
As seen above, there are multiple subsets of data when we analysed data. The raw data set contains six unique media sites and 20 different topics from 2020-01-01 to 2020-12-31. When we want to find a specific data and result of analysis, static forms of visualisation has a lot of limitations. For example, it would be extremely annoying to change everytime the filter values and compare the results of analysis. If we apply interactive visualisation technique, selecting the filter values could be much easier nd more flexible. Also, interactive visualisation is useful to show the results of analysis in one page view. In order to show everything we analysed before, the length of the report could be extremely long. Therefore, interactive visualisation also can help to readers to see the result at a glance.